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ABSTRACT 

OpenData movement around the globe is demanding more 
access to information which lies locked in public or private 
servers. As recently reported by a McKinsey publication, 
this data has significant economic value, yet its release has 
potential to blatantly conflict with people privacy. Recent 
UK government inquires have shown concern from various 
parties about publication of anonymized databases, as there 
is concrete possibility of user identification by means of link- 
age attacks. Differential privacy stands out as a model that 
provides strong formal guarantees about the anonymity of 
the participants in a sanitized database. Only recent results 
demonstrated its applicability on real-life datasets, though. 
This paper covers such breakthrough discoveries, by review- 
ing applications of differential privacy for non-interactive 
publication of anonymized real-life datasets. Theory, util- 
ity and a data-aware comparison are discussed on a variety 
of principles and concrete applications. 

Categories and Subject Descriptors 

H.4 [Information Systems Applications]: Miscellaneous; 
H.2 [database Management]: Database Applications — 
Statistical databases 
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Privacy-Preserving Data Publishing 
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1.1 Motivation 

In a recent report by McKinsey [31] it is estimated that in 
the developed economies of Europe alone, government ad- 
ministration could save more than 100 billion euros (149 bil- 
lion dollars) in operational efficiency improvements alone by 
leveraging big data. This term refers to the enormous quan- 
tity of information organizations around the globe collect 
daily. In particular, public institutions retain data about 
many aspects of our life, including medical, fiscal, trans- 
portation and criminal records. Private companies are also 
increasingly taking a bigger role in our private life by record- 
ing our Internet searches, friends network, financial transac- 
tions and transportation habits. Not everybody knows how 
to handle this information properly, though. UK, the lead- 
ing European country in terms of Open Data, recently held a 
consultation [39] with public institutions and industry repre- 
sentatives to discuss data publishing issues. Various parties 
expressed a clear concern about privacy issues, prompted in 
part by clamorous episodes of privacy breaches occurred in 
the past. In 1997 Latanya Sweeney, proved 87% of Ameri- 
can citizens can be uniquely identified just by knowing their 
gender, ZIP code and birthdate. To make a point of the 
claim she obtained this data from US public voting records of 
Massachussets and linked it with supposedly "anonymized" 
hospital records of public employees. Next, she sent to the 
Governor of Massachussets his medical records. The quest 
for true anonymization in the field of Privacy Preserving 
Data Publishing (PPDP) began, and it is still not over. In 
2006 Internet provider AOL released its search log contain- 
ing 3 months of searches of 650,000 users. Usernames were 
masked with random identifiers, still, in a matter of days, a 
New York Times reporter identified Thelma Arnold, a 62- 
year old widow from Lilburn, GA as user #4417749 [1], and 
her queries became known to the world. As a consequence of 
releasing this private dataset the CTO of AOL resigned, two 
employees were fired and a class action lawsuit is pending. 
Later the same year, Netflix, a DVD rental company released 
a perturbed version of one tenth of its database of movie rat- 
ings expressed by its customers. A prize of 1,000,000$ was 
offered to whoever improved by 10% the accuracy of the 
company's own recommandation algorithm. The following 
year the researchers Narayanan and Shmatikov proved it was 
possible to identify users by linking them to Imdb, a pub- 
lic database of movie ratings in which users voluntarily can 
publish their ratings[35]. This concerns prevented in 2010 
NetFlix from proposing a follow-up of the prize. 



1.2 Solutions 

Analysts want to have precise answers to queries about 
data, which can be sensitive. In the so-called interactive set- 
ting, information is protected inside a database handled by 
the data owner, and access to it is allowed only through an 
interface. Answers provided by the interface are processed 
in such a way to guarantee the anonymity of the partici- 
pants in the database. There are two main problems with 
this approach. Suppose we have a database about HIV pos- 
itive people containing their gender, ZIP code and birthdate, 
along with a numerical id and many other attributes. We 
might consider the identity of a patient at risk if anyone in 
the world gets to know at the same time at least three of his 
attributes. Then we might allow analyst A to know i.e. gen- 
der and ZIP of person #1, and analyst B the ZIP and the 
birthdate of the same person. The system can answer both 
questions and privacy in this model is not in danger only as 
long as they don't share information. If we wish to protect 
the data against collusion of data consumers as soon as two 
attributes about a given person are revealed to anybody we 
must disallow queries for all the remaining attributes. This 
can soon prevent the system to answer any query to new 
analysts. For example, if we allow each analyst to know 
at most one attribute per person, since future queries are 
unknown it might happen some analyst will never take ad- 
vantage of his right to know one attribute for a given tuple 
thus wasting information others might be interested in. In 
the non-interactive setting these problems are addressed by 
releasing once and for all the data which we think is of inter- 
est to most analysts, while still preserving privacy. Naturally 
the example we made is simplistic and with this paper we in- 
tend to prove a wealth of useful information can be published 
while formally maintaining strong privacy guarantees. Over 
the years, several solutions to solve the problem of protect- 
ing privacy in anonymized databases have been proposed. 
Examples are fc-anonymity [38], /-diversity [30], i-closeness 
[27]. All these methods suppose it is worth to distinguish 
data attributes into these groups: identifiers (i.e. name, sur- 
name), quasi- identifiers(i.e. ZIP code, gender, age) and sen- 
sisitive (i.e. pathology, rented AdultMovie). In legal terms, 
in the EU the Data Protection Directives [16] define per- 
sonal data as 'information related to an identified or iden- 
tifiable natural person'. It is a quite general definition, and 
for example even a house value can be classified as personal 
information as it might reveal its owner income. Recently, 
the European Data Protection Supervisor EDPS expressed 
its concerns [15] about a proposal on re-use of Public Sec- 
tor Information (PSI) previously adopted by the European 
Commission [14]. In particular it was recommended that 

"Where appropriate, the data should be fully or 
partially anonymised and license conditions should 
specifically prohibit re-identification of individu- 
als and re-use of personal data for purposes that 
may individually affect the data subjects." 

The purpose limitation is a difficult issue to solve in a con- 
text where PSI is put on the Internet for everybody to see. 
European transgressors who try to identify persons whose 
data is contained in a published anonymized dataset may 
be fined, but how to deal with non-European ones? Also, 
how is it possible to measure the degree of anonymization 
of a given dataset in order to decide if it is too risky to 
be published on the Internet? For example, the UK Office 



for National Statistics is going to release data collected in 
2011 anonymized with a record-swapping system [40], which 
involves selecting households which are deemed too identi- 
fiable and swapping them with other households which are 
not too far in the same geographical region and have similar 
values. Tables containing origin-destination data are con- 
sidered too hard to anonymize in a satisfying way so they 
are licensed only to restricted users. What are the theoret- 
ical basis for this distinction, if any? The EDPS calls for a 
'proactive approach' which should be taken by authorities, 
meaning privacy issues should be analyzed at the earliest 
stages and involved people informed throughout all the data 
process release. Linkage attacks shown before demonstrate 
how quasi-identifiers can be used to significantly increase the 
accuracy in identity disclosure, making the distinction with 
identifiers purely artificial. Also, sometimes the sole fact of 
knowing somebody is or is not in a database may provide 
a malicious user with valuable information to carry out an 
attack. So, how do we reach the so called privacy by design, 
when a data release process is devised to prevent disclosure 
with formal guarantees? To respond to these issues the con- 
cept of differential privacy was introduced by Dwork [10] to 
prevent attackers from being capable even to detect the pres- 
ence or absence of a given person in a database. Differential 
privacy falls in the category of so called perturbative meth- 
ods, which attempts to create uncertainty in the released 
data by adding some random noise. If database partici- 
pants are independent from each other, differential privacy 
promises that even if an attacker knows everything about 
every user in the db but one, by looking at the published 
statistics he won't be able to determine the identity of the 
remaining individual. Kieron O'Hara, in his 2011 indepen- 
dent transparency and privacy review to UK governement 
[36] mentions differential privacy as a cutting-edge technol- 
ogy that judges the computation of the anonymization al- 
gorithm as privacy-preserving or otherwise, rather than try- 
ing to make an impossible distinction between identifying 
and non- identifying data. This might sound promising, but 
O'Hara claims differential privacy appears to be limited to 
the interactive setting. Is this really true? Recent results in 
the non-interactive setting are encouraging. In what follows, 
we formalize some concepts about differential privacy. 

1.3 Basic definitions 

We use P(A) to indicate the probability of the occurrence 
of event A and define as the sum of all elements in 

vector x. 

Definition 1 (database). Given a database universe 
V we define a database D £ V as multiset of\D\ tuples from 
a universe IA each with k attributes. We say two databases 
D\ , D2 are neighbors if they differ in one tuple. We indicate 
such condition as D1AD2I = 1 

2. DIFFERENTIAL PRIVACY 

Randomized algorithms to publish sensitive data are called 
mechanisms. Since we are addressing the problem of statis- 
tical disclosure at large, we use TZ to denote a wide range of 
output possibilities for the mechanism designers, whose goal 
is to devise a mechanism function T> —¥ TZ. One possible 
choice of TZ could be V itself, meaning we are going either 
to release a new database composed by synthetic individu- 
als who hopefully follow the same distribution of the original 



participants or we publish a perturbed version of the orig- 
inal database, with real data randomly modified to satisfy 
differential privacy criteria. An another possible and popu- 
lar choice of 7Z is the set of queries qj counting how many 
individuals w, satisfy a given property jj A mechanism 
in order to be e-differentially private must satisfy the follow- 
ing definition first introduced by Dwork [12], which in recent 
years has become popular among researchers in the field of 
statistical disclosure: 

Definition 2 (e-DP). Given a randomized mechanism 
M : V — > 1Z and a real value e > 0, we say M satisfies 
e -differential privacy ifVDi,D 2 G T> such that ID1AD2I = 1 
and V7? C 7Z the following equation holds: 

P (M (Di) € R) ^ eP (M (D 2 ) G R) 

Differential privacy guarantees the following: a data re- 
lease mechanism is e-differentially private if, for any input 
database, any participant u in the database, and any pos- 
sible output of the release mechanism r, the presence or 
absence of participant u (in db terms, DiandD2 differing 
for one row) causes at most a multiplicative e e change in 
the probability of the mechanism outputting r. For exam- 
ple, if we want to release the count of people with HIV from 
a hypothetical medical database, we must devise a mech- 
anism C that when executed on databases differing in one 
person probably outputs the same result. We can build such 
a mechanism by first counting the persons with a counting 
function c : V — > N and then adding some noise to it. If 
the noise follows the Laplace distribution [12] we can have 
good outputs close to the true count at a rate exponentially 
greater than values far from it (see Fig. 1). To determine the 
amount of noise to add we must first introduce the concept 
of global sensitivity: 

Definition 3 (global sensitivity of a function). 
We define the global sensitivity A (/) of a function f :V — > 
I", w G N+, as 

A(/)= max ||/ {D,) - f {D^ 
D U D 2 e V 
P1AD2I = 1 

A function has low sensitivity if the addition or removal of 
one person to the database can only change the outcome of 
the function evaluation by a small amount. The so-called 
Laplace mechanism £ works in fact for any numerical func- 
tion / : T> — > R™ we want to compute on our database, but 
there is a catch: the amount of noise we must add is linked 
to the global sensitivity of /. If we apply first / on a db 
Di, and then on a neighboring db D2, if / changes a lot it 
means we will need to add more noise to probably obtain the 
same output. For the single counting function c the global 
sensitivity is low (A(c) = 1) and thus the noise to add is 
limited. 

2.1 Differential privacy weaknessess 

2.1.1 Relaxations 

Noise introduced by the randomization can produce re- 
sults far from the true ones, thus leading to scarce utility of 
the published output for data consumers. Many relaxations 
of differential privacy exists to address this problem and the 
major one is (e, ^-differential privacy: 



Definition 4 ((e, S)-dp [11]). Given a randomized mech- 
anism M : V — > 1Z we say M satisfies (e, S)- differential pri- 
vacy i/V£>i, D 2 G V such that \D 1 A.D 2 \ = 1 andRCTZ the 
following equation holds: 

P (M (Di) G R) e e P (M (D 2 ) G R) + S 

There are no hard and fast rules for setting e and 5. It is 
generally left to the data releaser, and usually 8 is taken 
to be very small, 5 < 1CP 4 . (e, 0)-dp is the same as e-dp. 
Among the other relaxations we mention (e, ^-probabilistic 
differential privacy ((e, <5)-pdp) [29]. A mechanism satisfying 
(e, <5)-pdp satisfies also (e, <5)-dp, but the converse does not 
hold. 

2.1.2 Is differential privacy good enough ? 

Some people say even differential privacy is not enough to 
adequately protect individuals from data disclosure. Kifer 
and Machanavajjhala in [23] point out that differential pri- 
vacy really works only if individuals are truly independent 
from each other. When there is no independence the partici- 
pation of somebody in the db can be inferred just by looking 
at other (supposedly known and in relation with the "vic- 
tim") entries. As a consequence, they claim we are forced to 
take into consideration adversarial knowledge, even if differ- 
ential privacy apparently freed us from such a burden. From 
a practical point of view, Dankar and El Emam [8] address 
several issues of differential privacy in the context of health 
care. They evidence a lack of real-life deployments of dif- 
ferentially private datasets, which might cause difficulties 
in assessing responsabilities if privacy breaches occur (was 
the e value appropriate, who else used with success such an 
e? etc.). It might also be difficult to explain the level of 
anonymization guaranteed to patients, as e is a parameter 
of a formula quite theoretical in nature. Furthermore, since 
published data is obtained through randomization, some- 
times it may look hard to believe - i.e. a randomized census 
dataset may indicate there are people living at the center of 
a lake. As a consequence, analysts might be lead to mistrust 
the approach (or who applied it). 

2.2 Mechanisms 

The two main mechanisms are the already described Laplace 
[12] and the Exponential mechanism [32]. The former is 
used when the output is numerical while the latter when 
outputs are not real or make no sense after adding noise. 
Other mechanisms are Li et al's matrix mechanism [26], the 
geometric mechanism (a discretized version of the Laplace 
mechanism) by Ghosh et al [17] and the Gaussian mecha- 
nism [11]. 

3. MEASURING UTILITY 

Broadly speaking, the utility of a mechanism is its ca- 
pability to minimize the error, which is a measure of the 
distance between original input db/statistics on it and noisy 
output db/statistics . Only utility of restricted classes of 
queries can be guaranteed [2] in the non-interactive setting. 
Blum, Ligett, and Roth [2] showed that in such setting it 
is possible to answer exponentially sized families of count- 
ing queries so in this paper we will mostly look at solutions 
for publishing data that are useful for such queries. How- 
ever, the choice of suitable statistics is a difficult problem as 
these statistics need to mirror the sufficient statistics of ap- 
plications that will use the sanitized database, and for some 



applications the sufficient statistics are hard to characterize. 
Popular approaches to measure utility are (a, /3)-usefullness 
[2], relative error with correction for small queries [42, 4] 
and without correction [7, 43], absolute error [6, 9, 28], vari- 
ance of the error[7, 42, 9], euclidean distance [28, 19]. In the 
following, we are going to define them more precisely. 

Definition5 ((q, /3) —usefulness [2]). A privacy mech- 
anism M is (a, P)-useful for queries in class C if with prob- 
ability 1 — P, for every Q £ C and every dataset D £ V, for 

D=M(D),\Q(p) -Q(D)\ <a 

It is adopted in [2], [43] (only for a basic cell based al- 
gorithm), and [3]. (a, 5)- usefulness is effective to give an 
overall estimation of utility, but according to [5] fails to pro- 
vide intuitive experimental results. [5, 4, 42] experimentally 
measure the utility of sanitized data for counting queries by 
relative error adopting this formula: 

Definition 6 (relative error). Let Q be a query and 
M : U —¥ R a privacy mechanism. We denote relative er- 
ror as rel (Q) — ^ixtQtD)^ wnere s is a sanity bound that 
mitigates the effects of the queries with excessively small se- 
lectivities. In both [42] and [5] s is set to 0.1% of \T>\ 

When the database is considered as a vector of reals (so 
k = l,Ai = TV) the euclidean distance can be used as util- 
ity. Li et al in [28] measure the error as the euclidean 
distance between original and noisy database Err (D) — 
\\D — M (D)\\ 2 , claiming their mechanism is capable in such 
a way to guarantee the utility for any class of queries. Hardt 
et al [19] measure the euclidean distance between query re- 
sponses. 

4. METHODS 

Several methods have been proposed to address the issue 
of releasing differentially private data. Broadly speaking, 
they can be divided in the categories of histogram construc- 
tion, sampling and filtering, partitioning, dimensionality re- 
duction. The notation O indicates complexity with hidden 
logarithmic factors. 

4.1 Computing histograms 

A histogram is a disjoint partition of the database points 
with the number of points which fall into each partition. 
Publishing a noisy version of the histogram is appealing be- 
cause of its usefullness for counting queries. However, the 
quality of queries executed on the histogram may be low. If 
a query requires the sum of n histogram points, since each 
of them has some noise the total noise sums up n times 
and can quickly become intolerable. Another issue regards 
\U\ cardinality. As pointed out by [6], any data with sev- 
eral attributes Ai leads to huge contingency matrices of size 
Ili |.Aj|. Among the works suffering from this problem we 
find [13, 9, 42, 43, 20, 26]. [42] operates a tranform on 
the counts and adds noise in the wavelet domain in time 
0(|W| + \D\), and similar techniques via post-processing with 
overlapping information are suggested in [20]. Li et al [26] 
generalizes last two approaches with the introduction of the 
matrix mechanism that generates an optimal query strategy 
based on the query workload of linear count queries. No ef- 
ficient algorithm is provided, though. One possible solution 
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Density p 


OnTheMap [41] 


3-5% 


Census Income [34] 


0.4-4% 


UCIAdultData[24] 


0.14% 
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Figure 1 



to the histogram problem is to take advantage of sparsity 
of data present in many databases. This condition occurs 
when the number of cells \U + \ with positive count in the 
contingency table in the database at hand is much bigger 
than zero-valued entries. To prove this fact Cormode et al 
in [6] define sparsity p as p — |W + | /II; \Ai\ . 

Table 1 is an example of the fact many natural datasets 
have low density in the single-digit percentage range, or less. 
Applying differential privacy naively generates output which 
is 1/p times larger than the data size. In the above examples, 
1/p ranges from tens to thousands, which is clearly not 
practical for today's large data sizes. Among the methods 
which exploit data sparsity we find [6, 28, 3]. In [3] this 
definition of m-sparse queries is proposed: 

Definition 7 (m-SPARSE query[3]). We say that a lin- 
ear query Q is m-sparse if it takes non-zero values on only 
m universe elements, and that a class of queries is m-sparse 
if each query it contains is m' -sparse for some m' ^ m. 

4.2 Sampling and filtering 

For the sampling and filtering category the idea is to avoid 
publishing huge contingency tables by filtering out entries 
with small counts, which are often in significant quantity 
in many databases. Cormode et al [6] adopt a variety of 
filtering techniques - highpass filtering and priority sam- 
pling being the most useful - to override the costly oper- 
ation of materializing a complete noise contingency table. 
Their method is suited for sparse datasets. For search log 
analysis in [25] and [18] a mechanism is proposed to release 
noisy aggregated user query and clicked url counts by filter- 
ing out excessively small counts. However, such approaches 
break the association between distinct query-url pairs in the 
output since all the user-IDs are removed, which might be 
useful in only a few applications. Therefore, in [21] a sam- 
pling method is proposed to allow analysis in exactly the 
same fashion and for the same purpose as the original data. 
However, (e, <5)-pdp is adopted to provide formal guarantess 
because relaxations are indispensable in search log publish- 
ing as proven in [18]. 

4.3 Partitioning 

Partitioning is indicated for ordered attributes such as 
spatial data. Like in algorithms computing histograms, the 
universe U is divided into regions but in this case the shape 
of the cells is not fixed and an attempt is made to find an 
optimal subdivision of the space. Regions may be overlap- 
ping. The goal is to optimize the results of range queries, 
where the analyst asks for the number of people lying under 
a given query area, usually expressed as an hyperrectangle. 
This calculation involves the sum of already published noisy 
counts so a strategy to allow the user to minimize the total 
noise variance must also be provided. A popular approach 
to partitioning is with fed-trees: at each round, an attribute 



is chosen and points in the database are split with some cri- 
teria. Usually uniformity in the number of points on both 
sides of the splitting line is considered by choosing the me- 
dian. Noisy counts of the two newly founded partitions are 
then published and partitioning is done recursively. The idea 
of differentially private data-partitioning index structures is 
suggested in the context of private record matching in [22]. 
The approach there is based on using an approximate mean 
as a surrogate for median (on numerical data) to build kd- 
trees. The approach of [43] imposes a fixed resolution grid 
over the base data. It then builds a kd-tree based on noisy 
counts in the grid, splitting nodes which are not considered 
'uniform', and then populates the final leaves with 'fresh' 
noisy estimated counts. Quadtree partitioning simply im- 
poses a recursive fixed grid in which at each round the space 
is divided into four rectangular cells of the same size. In 
[7] a comparison between several median finding methods, 
Hilbert R-trees and quadtrees partitioning is performed and 
privacy budget is allocated in a geometrically increasing way 
to counts during the partitioning of 2D data. Attention is 
devoted in post-processing the noisy counts to achieve con- 
sistency and minimum error variance in time linear in the 
size of the published tree. Quadtree partitioning is found to 
be fast and superior in quality of the output to all the other 
tested methods. 

4.4 Dimensionality reduction 

Dimensionality reduction methods usually consider the 
database as a matrix and apply random projections on it. 
In this line of research we find [3] in which for the class of 
linear counting queries that are m-sparse a method based 
on releasing a perturbed random projection of the private 
database together with the projection matrix is described. 
Running time is polynomial in the database size \D\, m, and 
log|W|. In [45] compression is applied to obtain a reduced 
syntethic database D' of size |D'| <C |D| in polynomial time. 
Li et al [28] apply compressive sensing to obtain a perturbed 
database from sparse data through decompression in time 
6(\D\). 

5. APPLICATIONS 

In recent years differential privacy has been successfully 
applied to a wide range of real-world data, although gener- 
ally with no quality assessment by final users of anonymized 
datasets. In [29] (e, <5)-pdp is introduced to model spatial 
data. This solution is then compared by Cormode with 
his work in [6]. In Y.Xiao et al [43] a fcd-tree technique 
is applied on CENSUS data [34], and results are found su- 
perior to Inan et al hierarchical tree method [22]. More- 
over, the open source HIDE platform [44] is provided to 
experiment with four differentially private algorithms: [20, 
22, 42, 5]. Cormode later in [7] found his algorithm to 
give less error than Inan's [22] and Y. Xiao works [43]. In 
[5] MSNBC [33] and STM [37] datasets represented as set- 
valued boolean data are considered. The only comparison is 
performed for MSNBC against basic noisy datacube method 
of Dwork's[12], as STM has big universe \U\ size and few 
methods are capable to handle this situation. STM dataset 
represented as sequences of locations is also considered in [4] , 
although location coordinates nor time intervals are taken 
into account. In [18] publication of counting queries for 
search logs is considered, but dataset origin is not speci- 
fied. In [21] AOL search log [1] is adopted for experimental 



tests. [42] performs experiments on CENSUS data [34] using 
binning to have |W| ?»16,000,000. 

6. SYNTHETIC DATABASES 

There have been few attempts to devise mechanisms of 
the kind M : V — > V, because privacy in these cases is 
more difficult to preserve. Outputs can be either a synthetic 
database - in which individuals follow the same distribution 
as in the original database - or just a perturbed version, 
where rows are directly taken from the original database 
with some modification to guarantee anonymity. Perturbed 
database release is considered in [5, 4, 28]. Synthetic data 
is released with methods proposed in [45, 29, 21]. 

7. CONCLUSIONS 

Differential privacy provides formal guarantees that pub- 
lic opinion needs when privacy is at stake, yet for many years 
such requirements were judged by researchers too strict to be 
applicable. Recently, several breakthrough results changed 
this mood. We presented a variety of methods - partition- 
ing, dimensionality reduction, sampling and filtering - which 
have been successfully applied to many real-life datasets. 
Some methods were also shown for histogram publishing, 
which, albeit unfeasibile on certain datasets with big uni- 
verse size, can still be used in practice on some real life 
datasets. Most of the papers we discussed about use a plain 
e-dp model which seems to indicate relaxations may not re- 
ally be needed except in problematic cases like search log 
publishing. Differential privacy can be applied efficiently 
with formal guarantees to set-valued data [5] , to sparse data 
for counting queries [3] and for general purpose queries [28] . 
When data is not sparse and |W| is not too big [42] can be 
used with success. For the difficult case of search log pub- 
lishing Hong et al [21] showed it is even possible to publish 
a perturbed database while maximizing utility. Less formal 
guarantees but good practical results are provided efficiently 
in [4] for sequences of short length and Cormode [6] for dis- 
crete data. Good results were obtained for bidimensional 
spatial data in [7]. For these reasons time is ripe for the 
Open Data movement to start considering the adoption of 
differential privacy and provide people with adequate guar- 
antees about the way their data is handled. Research has 
still to be done to impose constraints on output data in order 
to avoid inconsistencies and to properly anonymize highly di- 
mensional non-sparse data and preserving utility of general 
classes of queries. In this regard, publication of synthetic 
or perturbed datasets seems a promising approach, which 
needs careful query utility examination. 
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